The following programs are available for Humanities users: accent user-controlled accent module cedilla convert plus mark to cedilla cfreq character (or digraph) frequency count dict split file into dictionary sections exclude exclusion module for concordance format format and count keywords in concordance freq word frequency count kwal key word and line concordance kwic key word in context concordance lemma user-controlled lemmatization module * lno line number (also double lines, hemistichs, strophes) maxwd locate, measure and print longest word (or line) pair set two files side by side (or merge lines) pause stop terminal output to change type ball revconc reverse concordance module sfind find sentence (or record) matching a pattern skel prompt user with database skeleton * togrk convert Greek transcription for typesetter * tolpr filter output for lineprinter tosel convert English for Selectric terminal * tprep prepare text for concordance (trim, pad, or unpad) troffmt format concordance for typesetter * umlaut convert plus mark to umlaut wdlen tabulate word lengths and print histogram wheel roll through text a word cluster at a time xref cross reference words and linenumbers * not widely distributed, but available on request To get the manual pages for any of these programs, type: % human programname@@@ Fin de man/INDEX.man echo man/Makefile cat >man/Makefile <<'@@@ Fin de man/Makefile' MAN = ../man/
all: $(MAN)index $(MAN)accent $(MAN)cfreq $(MAN)dict $(MAN)exclude $(MAN)format $(MAN)freq $(MAN)kwal $(MAN)kwic $(MAN)lno $(MAN)maxwd $(MAN)pair $(MAN)pause $(MAN)revconc $(MAN)skel $(MAN)sfind $(MAN)tolpr $(MAN)tosel $(MAN)tprep $(MAN)wdlen $(MAN)wheel $(MAN)xref
$(MAN)index: INDEX.man
nroff -man INDEX.man > $(MAN)index
$(MAN)accent: accent.man
nroff -man accent.man > $(MAN)accent
ln $(MAN)accent $(MAN)cedilla
ln $(MAN)accent $(MAN)umlaut
$(MAN)cfreq: cfreq.man
nroff -man cfreq.man > $(MAN)cfreq
$(MAN)dict: dict.man
nroff -man dict.man > $(MAN)dict
$(MAN)exclude: exclude.man
nroff -man exclude.man > $(MAN)exclude
$(MAN)format: format.man
nroff -man format.man > $(MAN)format
$(MAN)freq: freq.man
nroff -man freq.man > $(MAN)freq
$(MAN)kwal: kwal.man
nroff -man kwal.man > $(MAN)kwal
$(MAN)kwic: kwic.man
nroff -man kwic.man > $(MAN)kwic
$(MAN)lno: lno.man
nroff -man lno.man > $(MAN)lno
$(MAN)maxwd: maxwd.man
nroff -man maxwd.man > $(MAN)maxwd
$(MAN)pair: pair.man
nroff -man pair.man > $(MAN)pair
$(MAN)pause: pause.man
nroff -man pause.man > $(MAN)pause
$(MAN)revconc: revconc.man
nroff -man revconc.man > $(MAN)revconc
$(MAN)sfind: sfind.man
nroff -man sfind.man > $(MAN)sfind
$(MAN)skel: skel.man
nroff -man skel.man > $(MAN)skel
$(MAN)tolpr: tolpr.man
nroff -man tolpr.man > $(MAN)tolpr
$(MAN)tosel: tosel.man
nroff -man tosel.man > $(MAN)tosel
$(MAN)tprep: tprep.man
nroff -man tprep.man > $(MAN)tprep
$(MAN)wdlen: wdlen.man
nroff -man wdlen.man > $(MAN)wdlen
$(MAN)wheel: wheel.man
nroff -man wheel.man > $(MAN)wheel
$(MAN)xref: xref.man
nroff -man xref.man > $(MAN)xref
@@@ Fin de man/Makefile
echo man/README
cat >man/README <<'@@@ Fin de man/README'
This file contains nroff/troff text for use with the -man
macro package (version 7 Unix only). To get printable manual pages
in the directory "../man", where users can read them by using the
"human" command, just use the "make" command in this directory.
This is very similar to the "make" for the C source code, except
that this "Makefile" calls "nroff -man".
@@@ Fin de man/README
echo man/accent.man
cat >man/accent.man <<'@@@ Fin de man/accent.man'
accent - user-controlled accent module cedilla - convert plus mark to cedilla umlaut - convert plus mark to umlaut
accent [ -a accfile ] [ filename ... ] cedilla [ filename ... ] umlaut [ filename ... ]
For convenience, two links to accent are provided: cedilla and umlaut. Cedilla converts the plus mark to a backspace and comma, which looks like a cedilla on the lineprinter; this convention is used even on the phototypesetter. Umlaut converts the plus mark to a backspace and double quote; this passes for an umlaut on unsophisticated output devices. The plus mark should follow the character that is to be accented; for example, type "Provenc+al" or "Mu+llerin" in your text. The use of a plus character to represent both accent marks implies that you cannot have cedillas and umlauts in the same text, unless you use the accent program.
Be sure to use the + option of kwic or kwal, or have a second line of zero-width characters in the punctuation file, in order to create the proper character alignment. Accent, cedilla, or umlaut should be called after sort, but before format. If you are using these programs, invoke the -d flag of sort to prevent accent marks from influencing dictionary order.
cfreq [ -p -a -d -m - ] filename ... -p: list all printable characters (blank - '~') -a: list all ascii characters (null - delete) -d: count digraphs rather than single characters -m: disable mapping of digraphs to lower case - : read standard input instead of files
Ordinarily, only alphabetic characters are listed. The -p option, however, gives all printable characters, including letters, digits, and punctuation marks. The -a option gives all 128 ascii characters, including control characters, letters, digits, and punctuation marks.
When given the -d flag, cfreq counts digraph frequencies. Spaces, tabs and newlines are considered valid characters, as are punctuation marks, digits, and control characters. When reading is finished, the digraphs are listed on the left, with the frequency counts on the right. If the -m flag is also invoked, cfreq will not map alphabetic characters to lower case, so you will end up with capitals among the digraphs.
dict [ - ] filename [ outfileroot ] -: read standard input rather than file
Theoretically, it is possible to write 128 different files, one for each ascii character. This means that each number goes into its own file, and that an upper case "A" and a lower case "a" will end up in different files. In the case of the kwic program, all keywords are already mapped to lower case, so there should be 26 or fewer files. Here is an example of a concordance program using kwic:
% kwic text* | sort | dict - /tmp/OUT % edit /tmp/OUT* % format /tmp/OUT* | lpr % rm /tmp/OUT*In the above example, dict makes small files out of one large file, so that you can edit the concordance until you are happy with it. The best and most useful concordances are always hand-edited.
-i: ignorefile contains words to be ignored, one per line -o: onlyfile has only words to be printed, one per line
Ordinarily, words to be ignored are read from ``exclfile'', but another ignore file can be specified after the -i option. (There is a list of common English words in /usr/lib/eign.) If you wish to preserve only a small set of words, and want all other words ignored, you can list these important words in the only file, and use the -o option; only words listed in that file will be sent through the filter. Words listed in the exclude file must be on a line of their own, with no blanks anywhere on the line.
Exclude should be used after kwic or kwal, but before sort, because eliminating unnecessary words before sorting will save large amounts of otherwise redundant machine time. Here is a sample command line using the exclusion routine:
% kwic textfile | exclude | sort | formatOf course, it is necessary to have words to be excluded in a file called ``exclfile'', residing in the same directory as ``textfile''. Eliminating prepositions and articles from a concordance can often shorten it by as much as one-third to one-half.
format [ -mck ] [ filename ... ] [ - ] -m: keywords not mapped from lower to upper case -c: suppress counting of keywords (will speed it up) -k: suppress printing of separate keyword - : read standard input instead of files
If for some reason you do not want an upper case keyword heading, you can preserve the lower case keywords by using the -m option. Keyword counting can also be suppressed by using the -c option; this will speed up the format program somewhat. To completely suppress printing of a separate keyword, use the -k option; this will produce only the identification field and the context.
Here is a typical program sequence for a concordance, suitable for sending to the lineprinter:
% kwic -c100 filename(s) | sort | format | lprThe -c100 argument to kwic creates a long context suitable for the lineprinter.
freq [ -n -m -dpfile - ] filename ... -n: list words in numerical order of frequency -m: disable mapping of letters to lower case -d: define punctuation set according to pfile - : read standard input instead of files
The -n option will cause the words to be listed by numerical order of frequency, with the most common words first. The -m flag will leave capital letters as they are. The -d option allows the user to define his own punctuation set. If this option is called, freq will replace the default punctuation set ,.;:-?!"()[]{} with the last line of the specified file.
kwal [ -kn -m -wS -fn -sn -r -ln -x -dF + - ] filename ... -kn: keyword is n characters long (defaults to 15) -m : keywords not mapped from upper to lower case -wS: write string S onto id field (use quotes around blanks) -fn: filename (up to n characters) written onto id field -sn: skip n characters of lefthand id field in text and write as id -r : reset linenumber to 1 at beginning of every file -ln: line numbering begins with line n (instead of 1) -x : line numbering is suppressed entirely -d : define punctuation set according to file F + : the + character indicates cedilla or umlaut - : read text from standard input (terminal or pipe)
By default, only the first 15 characters of the keyword are printed, followed by a vertical bar; longer keywords are truncated. If you want more or less than 15 characters in the keyword, use the -k option to lengthen or shorten it. To find the longest word in your text, try the maxwd program, and set -k accordingly. You can also use maxwd -l to determine the length of your longest context line. Keywords are mapped to lower case to ease the logistics of sorting, unless the -m option is specified.
The -w argument allows you to write an id field (such as the name of an author or work) after the keyword. If you want to include any blanks, enclose the entire string in quotes: -w"Poetic Edda". The -f argument allows you to write the current filename, up to a number of characters you specify. If the filename is shorter, it will be blank-padded, and if it is longer, it will be truncated.
If you are concording a series of short poems, each starting with line 1, type them into separate files, and use the -r option to reset the linenumber to 1 at the beginning of each new file. If you resume concording in the middle of your text, you can set the line number with the -l option. If your text is already numbered or identified, with a system that is not entirely arithmetic, such as by hemistich or by double lines, you can print your custom id field by using the -s option. This will skip over n characters of your lefthand id field embedded in the text, and print it as an id field, after the (-f) filename, but before the (-l) linenumber. When you also want to suppress linenumbering, use the -x option.
If you are working with a foreign language, and need to use normal punctuation marks as diacritical marks, you can change the default punctuation set with the -d option. Just type the punctuation marks you want into a file, on a single line with no embedded spaces, and specify the filename after the -d in your command line. If you have cedillas or umlauts, you can represent them as a `+' character after the accented letter. Use the `+' option of kwic, and filter your output through either the cedilla or umlaut program.
After generating the concordance, it should be alphabetized using the Unix sort program. Keywords should be grouped and counted with the format program, and the final results can be sent to the lineprinter. Here is a typical program sequence for generating a concordance:
% kwal poem* | sort | format | lprUsually, it is better to send the results of format to a file, where they can be examined and edited, before sending the file to the lineprinter.
kwic [ -kn -m -wS -fn -r -ln -pn -ic -cn -dF + - ] filename ... -kn: keyword is n characters long (defaults to 15) -m : keywords not mapped from upper to lower case -wS: write string S onto id field (use quotes around blanks) -fn: filename (up to n characters) written onto id field -r : reset linenumber to 1 at beginning of every file -ln: line numbering begins with line n (instead of 1) -pn: page numbering begins with page n (instead of 1) -ic: page incrementer is character c (defaults to =) -cn: context is n characters long (defaults to 50) -dF: define punctuation set according to file F + : the + character indicates cedilla or umlaut - : read text from standard input (terminal or pipe)
By default, only the first 15 characters of the keyword are printed, followed by a vertical bar; longer keywords are truncated. If you want more or less than 15 characters in the keyword, use the -k option to lengthen or shorten it. To find the longest word in your text, use the maxwd program, and set -k accordingly. Keywords are mapped to lower case to ease the logistics of sorting, unless the -m option is specified.
The -w argument allows you to write an id field (such as the name of an author or work) after the keyword. If you want to include any blanks, enclose the entire string in quotes: -w"Prose Edda". The -f argument allows you to write the current filename, up to a number of characters you specify. If the filename is shorter, it will be blank-padded, and if it is longer, it will be truncated.
If the program encounters the character "=", which, by default, indicates pagination, it will count pages as well as line numbers. Line numbers will print as: `` 12469'', while page numbers will print as: ``178,12''. If you are concording a series of short poems, each starting with line 1, type them into separate files, and use the -r option to reset the linenumber to 1 at the beginning of each new file. If you resume concording in the middle of your text, you can set the line number with the -l option, or the page number with the -p option. If you want to indicate pagination, make sure that you begin your text with ``=1'', on a line of its own, to indicate the first page. When a new chapter starts at the top of the page, be sure to set -p to the previous page. The page indicator can be changed with the -i option; -i% will change it to a percent sign, for instance.
If you are sending output to the lineprinter, the context width can be increased with the -c argument; -c110, for instance, will give you about 55 characters on either side of the keyword in context. Note that the lineprinter can print only 132 characters per line, so add up your field widths carefully.
If you are working with a foreign language, and need to use normal punctuation marks as diacritical marks, you can change the default punctuation set with the -d option. Just type the punctuation marks you want into a file, on a single line with no embedded spaces, and specify the filename after the -d in your command line. If you have cedillas or umlauts, you can represent them as a `+' character after the accented letter. Use the `+' option of kwic, and filter your output through either the cedilla or umlaut program.
After generating the concordance, it should be alphabetized using the Unix sort program. Keywords should be grouped and counted with the format program, and the final results can be sent to the lineprinter. Here is a typical program sequence for generating a concordance:
% kwic -c110 chapter* | sort | format | lprUsually, it is better to send the results of FORMAT to a file, where they can be examined and edited, before sending the file to the lineprinter.
lno [ +n -d -h -sn - ] filename ... +n : the beginning line number is n, not 1 -d : double line number text with long lines -h : hemistich number text with split lines -sn: number and letter strophes of n lines - : read standard input instead of files
With the -d option, lno numbers a text with long (Germanic) lines, which are generally labelled in editions as double lines. It starts at 1 (one), or at the specified line number, and goes up in increments of two at the end of each line.
With the -h option, lno numbers a text with hemistichs, or half lines. It starts at 1 (one), unless another beginning number is specified. The first line is labelled 1a, the second 1b, the third 2a, the fourth 2b, and so forth.
The -s option can be used to specify the number of lines in a strophe. For example, -s4 will produce 1a, 1b, 1c, 1d, 2a, and so on. The -h option is identical to the -s2 argument.
With the -d option, if you specify an even beginning number, all the following double line numbers will be even. With the -h option, all line pairs have a number postfixed with "a" and then "b", so if you want to begin with a "b", put an empty line in your text, to be labelled "a".
maxwd [ -l -dF - ] filename ... -l: look for longest line instead of longest word -d: define punctuation set according to file F - : read standard input instead of files
With the -l option, maxwd will look for the longest line, and print its filename, linenumber, and length. Similarly, the next line will contain the longest line, verbatim.
If several files are concatenated and sent through a pipe to maxwd, the filename will appear as "Stdin" and line numbering will continue to increment across file boundaries.
Maxwd should be used before concording a text with kwic or kwal, in order to determine what keyword length you should specify. If you are working with foreign languages, the -d option can be used to split words at the proper place; the punctuation file is compatible with many other related programs.
pair [ -m ] file1 [ - ] file2 [ +len1 [ +len2 ] ] -m: merge (intercalate) files line by line - : read standard input instead of files len1 and len2 denote screen width of file1 and file2
By default, pair prints two 40-character wide columns of text, which gives equal space to each text, and fills up the screen. The third and fourth arguments can be used to change the column width for the first and second files, respectively. For example, if your first file is composed of numbers but your second file contains text with occasional long lines, specify something like:
% pair file1 file2 +10 +70If you have long lines and would rather have lines from each text on separate lines, use the -m option.
Pair can be used for comparing textual variants. It is especially useful for making two texts parallel before analyzing the variants with diff or diff3. Diff compares two files, while diff3 compares three at a time. The results from these programs will be more usable if the texts are parallel before they are analyzed.
If your text proceeds in one language, and then changes to another for a quote, just put a ctrl-p in your text between sections. The terminal will pause until you change the printing device, and when you are ready to continue, you can type ctrl-d on the terminal.
It should be used between a series of pipes including kwic or kwal, sort, and format. Here is a suggested command sequence:
% kwic filename(s) | revconc | sort | revconc | formatIt must be used twice, or else the word will appear backwards in the final version. The first invocation of revconc reverses the keyword, so that sort operates from the back to the front, while the second invocation restores normal order to the word.
Many published concordances contain a Reverse List of Graphic Forms; revconc can be used for this purpose, but the Unix utility rev would probably be faster. Here is a suggested command sequence for making a Reverse List of Graphic Forms:
% prep filename(s) | rev | sort -u | revThe results can be put into columns with the Unix utility pr.
sfind [ -sc -ln -pn -ic -r ] 'pattern' [ - ] filename ... -sC: record separator set to C (or empty line with no C) -ln: line number is set to n (instead of 1) -pn: page number is set to n (default off) -ic: page incrementing character is c (not =) -r : reset linenumber to 1 with each new file - : read standard input instead of files
The pattern wildcard character `_' (underscore) matches any single character; it is similar to the `.' (period) in grep, or the `?' (question mark) in the shell. The wildcard character `*' (asterisk) matches any number of characters in your text until the pattern continues; it is exactly like the `*' wildcard in the shell. It is also similar, but not identical, to the `*' in grep, which matches zero or more repetitions of the previous character. To find an actual underscore or asterisk, precede these metacharacters with a backslash.
If you begin searching in the middle of a text, you can set the beginning line number (or page number) with the -l (or -p) option. For compatibility with the page incrementing feature of kwic, sfind will count pages if it encounters `=' (equals) in the text. The incrementing character can be changed with the -i option. If you want to reset the linenumber to 1 at the beginning of each new file, use the -r option.
The -s option is for use with databases where records are separated by a record separator. This character can be specified after the -s, and the program will operate a record at a time, rather than a sentence at a time. If the record separator is a magic shell character, it will have to be quoted or escaped with a backslash. A -s alone indicates that records are separated by a blank line, as are records in refer bibliographies. It is similar to the -F option of awk.
It is possible to escape to a system editor from where you can easily correct mistakes, by giving a ``tilde escape'' on the data line. The tilde must be the first character on the input line. In the distributed program, ~v will escape to vi, and ~e will escape to ex; both editors are part of 2bsd and 4bsd (Berkeley Software Distribution). If you don't have these editors, simply change the code and recompile the program so it will work with ed, or you own favorite editor.
The -2 option will cause output to be double spaced; -3 will cause triple spacing, and so forth. This is a substitute for the .ls 2 of nroff/troff, or the .nr VS 24 of the -ms macros. The -h option is used to print a header at the top of each page; it only works if pagination is in effect. The -s flag suppresses the shifting to the right.
The tosel program works much like the Unix utility cat. That is to say, it can be used to print out one or more files, or as a filter in a series of programs communicating by pipes. It can be used before or after pause, since it does nothing to Control-P.
< > | ^The circumflex character produces a blank space. These five characters are not rendered accurately:
[ { ] } `They produce, in order, these five similar characters:
( ( ) ) 'Of course, various IBM balls will differ, and will cause further program bugs. The tosel program was written for the IBM "Pica 72" 10-pitch ball, but will probably work perfectly for any ball that has the characters `!' and `1', and 1/2 and 1/4. @@@ Fin de man/tosel.man echo man/tprep.man cat >man/tprep.man <<'@@@ Fin de man/tprep.man'
tprep [ -y -tpu ] filename ... -y: say yes and suppress interactive prompting -t: trim lines, removing trailing blanks and tabs -p: pad, inserting blank at beginning of each line -u: unpad, deleting blank at beginning of each line
When typing in a text, it is practically impossible to avoid accidental spaces at the end of lines. These spurious blanks throw off the results of character counting, and are unsightly in a kwic-style concordance. Also, before compiling a kwic concordance, you may want to pad each line with a blank, so that the slash indicating newline is not followed too closely by the next word. After finishing the concordance, the padding can be removed, using the unpad option.
If you do not specify any options in the command line, you are prompted to make sure you want to rewrite your files. Then you are asked whether you want to use trim, pad or unpad. You can answer either with the full word, or with the first letter of these three words. Tprep also tells what files it is rewriting, and reports on the scope of the changes involved for each file.
troffmt [ -ckm ] [ filename ... ] [ - ] -c: suppress counting of keyword frequency -k: entirely suppress printing of keyword -m: do not supply concordance macros automatically - : read standard input instead of files
Keyword counting can be suppressed by using the -c option; this will speed up the program somewhat. To completely suppress printing of a separate keyword, use the -k option.
Here is a typical program sequence for a concordance, suitable for sending to the typesetter:
% kwic -f5 -c80 filename(s) | sort | troffmt | troff -QThe -c80 argument to kwic creates a context suitable for the typesetter. Anything larger may result in lines too long for the typesetter. If there is no -f or -w option, -c85 would be safe; with long -f or -w options, adjust -c accordingly.
wdlen [ -l -dPfile - ] filename ... -l: print long histogram suitable for lineprinter -d: define punctuation set according to Pfile - : read standard input instead of files
If there are a great number of words in your text, the dashes in the bar graph do not have a one to one correspondence with the frequency count, but are calculated so that the longest bar fills up the screen. The length of the bar can be extended with the -l option.
If you are working with foreign languages, the -d option can be used to split words at the proper place; the ``Pfile'' is compatible with many other related programs.
wheel [ +n -m -dF - ] filename ... +n: print clusters of n words (default 2) -m: do not map upper case to lower case -d: define punctuation set according to file F - : read standard input instead of files
After extracting all the word clusters in your text, they can be sorted and counted to find repeated patterns. Here is an example of a command line to accomplish this:
% wheel +3 text | sort | uniq -cOf course, sort can be applied to any field desired; ``sort +2'' refers to the third word on each line. It would be good to analyze syntactic clusters of two, three, four, and possibly more words a piece. British scholars use the cumbersome term ``collocation'' to mean word cluster.
xref [ -r -ln -pn -ic -dF - ] filename ... -r : reset linenumber to 1 at beginning of every file -ln: line numbering begins with line n (instead of 1) -pn: page numbering begins with page n (instead of 1) -ic: page incrementer is character c (defaults to =) -wn: width of output page is n (defaults to 80) -d : define punctuation set according to file F - : read text from standard input (terminal or pipe)
If you are cross referencing a number of short texts, you can reset the linenumber to 1 with the -r option. Line number and page number can be set with the -l and -p options. The default pagination character is the equals sign; if you have another page indicator, it can be set with the -i option. In case your text has equals signs that do not indicate a new page, you could use the -i option without a character afterwards, and page labelling will not occur.
Xref will also read a user-definable punctuation set from the file specified after the -d option. It can also read from standard input. Most importantly, the output width can be set with the -w option. For example, to send a cross reference index to the lineprinter, a -w130 is recommended. The default page width is 80, which is appropriate for a CRT terminal or for regular paper.